2,821 research outputs found

    Re-Pair Compression of Inverted Lists

    Full text link
    Compression of inverted lists with methods that support fast intersection operations is an active research topic. Most compression schemes rely on encoding differences between consecutive positions with techniques that favor small numbers. In this paper we explore a completely different alternative: We use Re-Pair compression of those differences. While Re-Pair by itself offers fast decompression at arbitrary positions in main and secondary memory, we introduce variants that in addition speed up the operations required for inverted list intersection. We compare the resulting data structures with several recent proposals under various list intersection algorithms, to conclude that our Re-Pair variants offer an interesting time/space tradeoff for this problem, yet further improvements are required for it to improve upon the state of the art

    Compact Binary Relation Representations with Rich Functionality

    Full text link
    Binary relations are an important abstraction arising in many data representation problems. The data structures proposed so far to represent them support just a few basic operations required to fit one particular application. We identify many of those operations arising in applications and generalize them into a wide set of desirable queries for a binary relation representation. We also identify reductions among those operations. We then introduce several novel binary relation representations, some simple and some quite sophisticated, that not only are space-efficient but also efficiently support a large subset of the desired queries.Comment: 32 page

    Universal Indexes for Highly Repetitive Document Collections

    Get PDF
    Indexing highly repetitive collections has become a relevant problem with the emergence of large repositories of versioned documents, among other applications. These collections may reach huge sizes, but are formed mostly of documents that are near-copies of others. Traditional techniques for indexing these collections fail to properly exploit their regularities in order to reduce space. We introduce new techniques for compressing inverted indexes that exploit this near-copy regularity. They are based on run-length, Lempel-Ziv, or grammar compression of the differential inverted lists, instead of the usual practice of gap-encoding them. We show that, in this highly repetitive setting, our compression methods significantly reduce the space obtained with classical techniques, at the price of moderate slowdowns. Moreover, our best methods are universal, that is, they do not need to know the versioning structure of the collection, nor that a clear versioning structure even exists. We also introduce compressed self-indexes in the comparison. These are designed for general strings (not only natural language texts) and represent the text collection plus the index structure (not an inverted index) in integrated form. We show that these techniques can compress much further, using a small fraction of the space required by our new inverted indexes. Yet, they are orders of magnitude slower.Comment: This research has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Sk{\l}odowska-Curie Actions H2020-MSCA-RISE-2015 BIRDS GA No. 69094

    Molecular Control of Fruit Ripening and Sensory Quality of Charentais Melon

    Get PDF
    Traditional Charentais melons have a typical climacteric behavior with ethylene playing a major role in the regulation of the ripening process. Genetic studies using climacteric and non-climacteric types of Cucumis melo demonstrated that the climacteric character is dominant and conferred by 2 duplicated loci only which are of great importance for the regulation of storability and sensory quality. Commercial varieties of Charentais melon with long shelf-life have been generated, some of them by crossing with a non-ripening Charentais genotype (Vauclusien). The introduction of the long shelf-life character resulted in undesirable loss of aroma volatiles production. The inhibition of ethylene synthesis by knocking-down ACC oxidase gene expression has been achieved in Charentais melon. It results is a strong inhibition of the synthesis of aroma volatiles while the accumulation of sugars is not affected or is even improved and the softening of the flesh is strongly affected but not abolished. It was also demonstrated that ethylene-inhibited fruit exhibited better resistance to chilling injury. Due to the importance of aroma volatiles in sensory quality and to the strong negative correlation between aroma production and ethylene synthesis, we have developed a research program aimed at isolating genes involved in the synthesis of volatile esters, compounds that are essential for the flavor of Cantaloupe melons. We report here on the recent advances in the field with special emphasis on the characterization of two families of genes encoding aldehyde reductases and alcohol acyl transferases

    Characterization of Genes Involved in the Formation of Aroma Volatiles in "Charentais" Melon Fruit

    Get PDF
    Volatiles esters impart distinct characteristics to the fruit quality. "Charentais" cantaloupe melon (Cucumis melo "cantalupensis") is characterized by abundant sweetness and aromatic flavour. Plant alcohol acyl transferase (AAT) genes have been identified and shown to be involved in aromas production. Recently,two cDNAs (Cm-AAT1 and Cm-AAT2) putatively involved in the formation of aroma volatile esters have been isolated from melon fruit. Cm-AAT1 protein exhibit alcohol acyl transferase activity while no such activity could be detected for Cm-AAT2. Two new cDNAs (Cm-AAT3 and Cm-AAT4) have been isolated from melon fruit that showed 69% and 36% similarity, respectively, with Cm-AAT1. The percentage similarity over the whole amino acid sequence between them is 34%. Cm-AAT3 and Cm-AAT4 show the highest similarity to the tobacco Nt-HSR201 protein and a rose alcohol acyltransferase Rh-AAT1, respectively. All Cm-AATs genes, share three conserved regions common to the BAHD acyltransferase gene superfamily. Heterologous expression in yeast revealed that some of the encoded proteins have a wide range of specificity while others are specific to a narrow range of substrates

    Mechanisms of Fruit Ripening: Retrospect and Prospects

    Get PDF
    This paper aims at giving an overview of the progress made during the last decades on the mechanisms of fruit ripening and to present the most recent trends and prospects for the future. Important steps forward will be presented (respiratory climacteric, ethylene biosynthesis and action, isolation of genes involved in the ripening process, biotechnological control of fruit ripening....) by showing how the judicious exploitation of the data published previously, the strategies, methodologies and plant material adopted have been crucial for the advancement of knowledge. Opportunities of co-operation between geneticists and post-harvest physiologists as well as new possibilities offered by genomics, proteomics and metabolomics for the understanding of the fruit ripening process and the development of sensory quality will be developed

    The CAMOMILE collaborative annotation platform for multi-modal, multi-lingual and multi-media documents

    Get PDF
    In this paper, we describe the organization and the implementation of the CAMOMILE collaborative annotation framework for multimodal, multimedia, multilingual (3M) data. Given the versatile nature of the analysis which can be performed on 3M data, the structure of the server was kept intentionally simple in order to preserve its genericity, relying on standard Web technologies. Layers of annotations, defined as data associated to a media fragment from the corpus, are stored in a database and can be managed through standard interfaces with authentication. Interfaces tailored specifically to the needed task can then be developed in an agile way, relying on simple but reliable services for the management of the centralized annotations. We then present our implementation of an active learning scenario for person annotation in video, relying on the CAMOMILE server; during a dry run experiment, the manual annotation of 716 speech segments was thus propagated to 3504 labeled tracks. The code of the CAMOMILE framework is distributed in open source.Peer ReviewedPostprint (author's final draft

    Space-Efficient Data Structures for Information Retrieval

    Get PDF
    The amount of data that people and companies store has grown exponentially over the last few years. Storing this information alone is not enough, because in order to make it useful we need to be able to efficiently search inside it. Furthermore, it is highly valuable to keep the historic data of each document stored, allowing to not only access and search inside the newest version, but also over the whole history of the documents. Grammar-based compression has proven to be very effective for repetitive data, which is the case for versioned documents. In this thesis we present several results on representing textual information and searching in it. In particular, we present text indexes for grammar-based compressed text that support searching for a pattern and extracting substrings of the input text. These are the first general indexes for grammar-based compressed text that support searching in sublinear time. In order to build our indexes, we present new results on representing binary relations in a space-efficient manner, and construction algorithms that use little space to achieve their goal. These two results have a wide range of applications. In particular, the representations for binary relations can be used as a building block for several structures in computer science, such as graphs, inverted indexes, etc. Finally, we present a new index, that uses on grammar-based compression, to solve the document listing problem. This problem deals with representing a collection of texts and searching for the documents that contain a given pattern. In spite of being similar to the classical text indexing problem, this problem has proven to be a challenge when we do not want to pay time proportional to the number of occurrences, but time proportional to the size of the result. Our proposal is designed particularly for versioned text, allowing the storage of a collection of documents with all their historic versions in little space. This is currently the smallest structure for such a purpose in practice
    • …
    corecore